This document presents the work done for Assignment 3 in the Statistical Learning course at Universidad Carlos III de Madrid. The objective of this assignment is to implement the Expectation-Maximization (EM) algorithm for estimating the parameters of a Gaussian Mixture Model (GMM) across different data dimensions.
Specifically, the assignment requires:
This report details the methodology, results, and insights gained from the implementation and testing of the EM algorithm. # Functions ## Initialization function Initialization is the first step in the EM algorithm and is crucial because convergence depends heavily on its setup, while the quality of the initial clusters influences the final solution. In this implementation, we initialize the cluster means by randomly selecting KK data points, ensuring that the means start within a reasonable range. The covariance matrices are set to identity matrices to maintain numerical stability and prevent singularities. The mixture coefficients are uniformly initialized to ensure that no cluster dominates the process at the start. This method provides a simple and effective way to initialize EM without relying on K-means, making the procedure fully autonomous.
initialize_em_parameters = function(X, K) {
N = nrow(X) # Data rows
D = ncol(X) # Data columns
# Randomly select K data points as initial means
mu = X[sample(1:N, K), ]
# Initialize covariance matrices as identity matrices
sigma = array(diag(D), dim = c(D, D, K))
# Initialize equal mixing coefficients
pi_k = rep(1/K, K)
return(list(mu = mu, sigma = sigma, pi_k = pi_k))
}
Generate plot for 1D and 2D image
# Function to compute 2D Gaussian ellipses
get_ellipse = function(mu, sigma, npoints = 100) {
angles = seq(0, 2 * pi, length.out = npoints)
circle = cbind(cos(angles), sin(angles)) # Unit circle
chol_decomp = chol(sigma) # Cholesky decomposition of covariance
ellipse_points = circle %*% t(chol_decomp) # Transform to Gaussian shape
ellipse_points = sweep(ellipse_points, 2, mu, "+") # Shift to mean
return(as.data.frame(ellipse_points))
}
# Function to plot 1D, 2D, and 3D GMM results
plot_1D_2D_3D = function(X, gamma, mu, sigma) {
D = ncol(X) # Dimension of data (1D, 2D, or 3D)
K = nrow(as.matrix(mu)) # Ensure mu is a matrix
cluster_labels = apply(gamma, 1, which.max) # Assign each point to a cluster
data_plot = as.data.frame(X)
data_plot$cluster = as.factor(cluster_labels) # Convert to categorical
# ---- 1D Case ----
if (D == 1) {
p = ggplot(data_plot, aes(x = V1, fill = cluster)) +
geom_histogram(alpha = 0.5, bins = 30, position = "identity") +
geom_vline(xintercept = as.vector(mu), col = "black", linetype = "dashed", size = 1) +
labs(title = "1D Gaussian Mixture Clustering", x = "Value", y = "Frequency") +
theme_minimal()
return(p)
# ---- 2D Case ----
} else if (D == 2) {
p = ggplot(data_plot, aes(x = V1, y = V2, color = cluster)) +
geom_point(alpha = 0.6) + # Data points
geom_point(data = as.data.frame(mu), aes(x = V1, y = V2),
color = "black", size = 4, shape = 4) + # Cluster means
# ---- Add Gaussian Ellipses ----
lapply(1:K, function(k) {
ellipse_data = get_ellipse(mu[k,], sigma[,,k])
geom_path(data = ellipse_data, aes(x = V1, y = V2), color = "black", linetype = "dashed")
}) +
geom_density_2d()+
labs(title = "2D Gaussian Mixture Clustering", x = "X", y = "Y") +
theme_minimal()
return(p)
# ---- 3D Case ----
} else if (D == 3) {
# 3D Plot using plotly (same as before)
plot_3D = plot_ly() %>%
add_trace(
data = data_plot,
x = ~V1, y = ~V2, z = ~V3,
type = "scatter3d", mode = "markers",
marker = list(size = 3, opacity = 0.7),
color = ~cluster # Ensure color length matches points
) %>%
add_trace(
data = as.data.frame(mu),
x = ~V1, y = ~V2, z = ~V3,
type = "scatter3d", mode = "markers",
marker = list(size = 6, color = "black", symbol = "x")
) %>%
layout(
title = "3D Gaussian Mixture Clustering",
scene = list(xaxis = list(title = "X"),
yaxis = list(title = "Y"),
zaxis = list(title = "Z"))
)
return(plot_3D)
} else {
stop("This function only supports 1D, 2D, and 3D data!")
}
}
reconstruct_image = function(image_path, em_result, output_path = "segmented_image.jpg") {
# Load the original image
img = readJPEG(image_path)
dim_img = dim(img)
# Get cluster assignments (each pixel assigned to a cluster)
clusters = apply(em_result$gamma, 1, which.max)
# Reconstruct the image using the cluster means
segmented_matrix = em_result$mu[clusters, ]
# Reshape back into the original image format
segmented_img = array(segmented_matrix, dim = dim_img)
# **Fix orientation issues**
segmented_img = aperm(segmented_img, c(2, 1, 3)) # Swap rows and columns
# Save the corrected segmented image
writeJPEG(segmented_img, output_path)
# Display the corrected segmented image
plot(as.cimg(segmented_img))
# Prevent automatic printing
invisible(segmented_img)
}
plot_log_likelihood = function(log_likelihood_values) {
iterations = 1:length(log_likelihood_values)
log_likelihood_df = data.frame(Iteration = iterations, LogLikelihood = log_likelihood_values)
ggplot(log_likelihood_df, aes(x = Iteration, y = LogLikelihood)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "EM Algorithm Log-Likelihood Convergence",
x = "Iteration",
y = "Log-Likelihood") +
theme_minimal()
}
The EM function was implemented following the algorithm described in Pattern Recognition and Machine Learning by Christopher Bishop (2006, Chapter 9). The Expectation-Maximization (EM) algorithm iteratively estimates the parameters of a Gaussian Mixture Model (GMM) by maximizing the likelihood function. In the E-step, it computes the responsibilities \(\gamma_{ik}\), which represent the probability that each data point \(\mathbf{x}_i\) belongs to cluster \(k\):
\[ \gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \mathbf{\mu}_k, \mathbf{\Sigma}_k)} {\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \mathbf{\mu}_j, \mathbf{\Sigma}_j)} \]
where \(\pi_k\) is the mixing coefficient, \(\mathbf{\mu}_k\) is the mean, and \(\mathbf{\Sigma}_k\) is the covariance matrix of cluster \(k\). In the M-step, the parameters are updated using the computed responsibilities. The new cluster means are estimated as:
\[ \mathbf{\mu}_k = \frac{1}{N_k} \sum_{i=1}^{N} \gamma_{ik} \mathbf{x}_i \]
where \(N_k = \sum_{i=1}^{N} \gamma_{ik}\) represents the effective number of points assigned to cluster \(k\). The covariance matrices are updated as:
\[ \mathbf{\Sigma}_k = \frac{1}{N_k} \sum_{i=1}^{N} \gamma_{ik} (\mathbf{x}_i - \mathbf{\mu}_k)(\mathbf{x}_i - \mathbf{\mu}_k)^T + \epsilon I \]
where \(\epsilon I\) is a small regularization term added for numerical stability, as suggested in Bishop (2006, Section 9.2.2). The algorithm iterates until the log-likelihood function:
\[ \log L = \sum_{i=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_i | \mathbf{\mu}_k, \mathbf{\Sigma}_k) \right) \]
converges. The function ensures an optimal and numerically stable implementation of EM for 1D, 2D, and high-dimensional data, including image segmentation. ## Function of EM for 1D
EM_GMM_1D = function(X, K, max_iter = 100, tol = 1e-6) {
N = length(X) # n data
set.seed(234)
mu = runif(K, min(X), max(X)) # Random means
sigma = rep(var(X), K) # Initialize sample variance
pi_k = rep(1/K, K) # Equal mixture weights initially
gamma = matrix(0, N, K) # Responsibility matrix
log_likelihood_values = numeric(max_iter) # Store log-likelihood values
log_likelihood = function() {
sum(log(rowSums(sapply(1:K, function(k) {
pi_k[k] * dnorm(X, mean = mu[k], sd = sqrt(sigma[k]))
}))))
}
logL = log_likelihood()
log_likelihood_values[1] = logL
for (iter in 2:max_iter) {
# E-step: Compute responsibilities
for (k in 1:K) {
gamma[, k] = pi_k[k] * dnorm(X, mean = mu[k], sd = sqrt(sigma[k]))
}
gamma = gamma / rowSums(gamma)
# M-step: Update parameters
Nk = colSums(gamma) # Effective number of points in each cluster
pi_k = Nk / N # Update mixing coefficients
mu = colSums(gamma * X) / Nk # Update means
sigma = sapply(1:K, function(k) sum(gamma[, k] * (X - mu[k])^2) / Nk[k])
# Compute new log-likelihood
new_logL = log_likelihood()
log_likelihood_values[iter] = new_logL
# Check for convergence
if (abs(new_logL - logL) < tol) {
message("Converged at iteration ", iter)
log_likelihood_values = log_likelihood_values[1:iter]
break
}
logL = new_logL
}
return(list(mu = mu, sigma = sigma, pi_k = pi_k, gamma = gamma, log_likelihood = log_likelihood_values))
}
EM_GMM = function(X, K, max_iter = 100, tol = 1e-6) {
# Check if X is a single-column matrix (1D case)
if (ncol(X) == 1) {
print("1D detected, calling EM_GMM_1D")
return(EM_GMM_1D(as.vector(X), K, max_iter, tol)) # Convert to vector for 1D function
}
N = nrow(X) # Number of data points
D = ncol(X) # Data dimensions
params = initialize_em_parameters(X, K)
mu = params$mu
sigma = params$sigma
pi_k = params$pi_k
gamma = matrix(0, N, K) # Responsibility matrix
log_likelihood_values = numeric(max_iter) # Store log-likelihood values
X = scale(X)
mu = scale(mu)
log_likelihood = function() {
sum(log(rowSums(sapply(1:K, function(k) {
pi_k[k] * dmvnorm(X, mean = as.numeric(mu[k, , drop = FALSE]), sigma = sigma[, , k])
}))))
}
logL = log_likelihood()
log_likelihood_values[1] = logL
for (iter in 2:max_iter) {
for (k in 1:K) {
gamma[, k] = pi_k[k] * dmvnorm(X, mean = as.numeric(mu[k, , drop = FALSE]), sigma = sigma[, , k])
}
gamma = gamma / rowSums(gamma)
Nk = colSums(gamma)
pi_k = Nk / N
mu = matrix(t(gamma) %*% X / Nk, ncol = D, byrow = TRUE)
if (K == 1) {
mu = matrix(mu, nrow = 1, ncol = D)
}
for (k in 1:K) {
X_centered = sweep(X, 2, mu[k, , drop = FALSE], FUN = "-")
sigma[, , k] = t(X_centered) %*% (X_centered * gamma[, k]) / Nk[k] + diag(1e-6, D)
}
new_logL = log_likelihood()
log_likelihood_values[iter] = new_logL
if (abs(new_logL - logL) < tol) {
message("Converged at iteration ", iter)
log_likelihood_values = log_likelihood_values[1:iter]
break
}
logL = new_logL
}
return(list(mu = mu, sigma = sigma, pi_k = pi_k, gamma = gamma, log_likelihood = log_likelihood_values))
}
The first test is 1D this is a vector with 2 groups within the same vector.
set.seed(123)
X_1D = matrix(c(rnorm(100, mean = -2, sd = 1),
rnorm(100, mean = 3, sd = 1.5)), ncol = 1)
# Run EM Algorithm for 1D data
result_1D = EM_GMM(X_1D, K = 2)
## [1] "1D detected, calling EM_GMM_1D"
## Converged at iteration 59
# Corrected function call
plot_1D_2D_3D(X_1D, result_1D$gamma, result_1D$mu, result_1D$sigma)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_log_likelihood(result_1D$log_likelihood)
We are testing this within the matrix with multivariable data of 2 different population
# Generate synthetic 2D data
set.seed(123)
X_2D = rbind(
mvrnorm(n = 100, mu = c(-2, -2), Sigma = matrix(c(1, 0.5, 0.5, 1), ncol = 2)),
mvrnorm(n = 100, mu = c(3, 3), Sigma = matrix(c(1, -0.3, -0.3, 1), ncol = 2))
)
# Run EM Algorithm for 2D data
result_2D = EM_GMM(X_2D, K = 2)
## Converged at iteration 11
# Plot the clustering results
plot_1D_2D_3D(scale(X_2D), result_2D$gamma, t(result_2D$mu), result_2D$sigma)
plot_log_likelihood(result_2D$log_likelihood)
#3D
true_means = matrix(c(5, 5, 5, -5, -5, -5, 5, -5, 5), ncol = 3, byrow = TRUE)
true_covariances = list(
diag(c(2, 1, 1)), # Cluster 1
diag(c(1, 2, 1)), # Cluster 2
diag(c(1, 1, 2)) # Cluster 3
)
# Generate data from 3 different Gaussians
X_list = lapply(1:3, function(k) {
mvrnorm(n = 300 / 3, mu = true_means[k,], Sigma = true_covariances[[k]])
})
# Combine all generated data
X_3D = do.call(rbind, X_list)
colnames(X_3D) = c("V1", "V2", "V3")
em_results = EM_GMM(X_3D, K = 3)
## Converged at iteration 12
# Extract fitted parameters
gamma_est = em_results$gamma
mu_est = em_results$mu
sigma_est = em_results$sigma
# -----------------------
# STEP 3: Plot using plot_GMM()
# -----------------------
plot_1D_2D_3D(scale(X_3D), gamma_est, t(mu_est), sigma_est)
The images below show how the Expectation-Maximization (EM) algorithm groups pixels with similar colors to segment different parts of an image.
Since EM models the pixel distribution using a Gaussian Mixture Model (GMM), it clusters pixels based on their RGB values. Here are some key takeaways:
Overall, EM can be useful for image segmentation, but its effectiveness depends on how well colors form separate groups. When boundaries between colors are less defined, EM may not perform as expected.
# Load and reshape the image
img = readJPEG("imagen_assignment2/20BT.jpg")
dim_img = dim(img)
# Reshape into N × 3 (RGB) matrix
img_reshaped = cbind(as.vector(img[,,1]), as.vector(img[,,2]), as.vector(img[,,3]))
# Apply EM algorithm
result = EM_GMM(img_reshaped, K = 3)
reconstruct_image("C:/Users/flore/Desktop/git/Statistical_learning/Test_alpha/20BT.jpg",result)
# Load and reshape the image
img2 = readJPEG("imagen_assignment2/Melanoma.jpg")
dim_img = dim(img2)
# Reshape into N × 3 (RGB) matrix
img_reshaped2 = cbind(as.vector(img2[,,1]), as.vector(img2[,,2]), as.vector(img2[,,3]))
# Apply EM algorithm
result2 = EM_GMM(img_reshaped2, K = 2)
## Converged at iteration 65
reconstruct_image("imagen_assignment2/Melanoma.jpg",result2)
## Warning in as.cimg.array(segmented_img): Assuming third dimension corresponds
## to colour
# Load and reshape the image
img2 = readJPEG("imagen_assignment2/test_1.jpeg")
dim_img = dim(img2)
# Reshape into N × 3 (RGB) matrix
img_reshaped2 = cbind(as.vector(img2[,,1]), as.vector(img2[,,2]), as.vector(img2[,,3]))
# Apply EM algorithm
result2 = EM_GMM(img_reshaped2, K = 5)
reconstruct_image("imagen_assignment2/test_1.jpeg",result2)
## Classificacion with EM
In this part we are trying to see if the EM algorithm works for classification. First doing PCA and then apply the EM to create classification.
### Read Data functions
read_all_images = function(folder_path) {
image_files = list.files(folder_path, full.names = TRUE, pattern = "\\.(jpg|png|jpeg|tiff|bmp)$", ignore.case = TRUE)
# Sort filenames
image_files = mixedsort(image_files)
# Extract labels from filenames
extract_label = function(filename) {
base_name = tools::file_path_sans_ext(basename(filename))
label = gsub("[^0-9]", "", base_name) # Extract numeric part
return(label)
}
extract_filename = function(filepath) {
return(basename(filepath))
}
read_data = function(image_path) {
img = readImage(image_path)
red_aux = as.vector(img[,,1])
green_aux = as.vector(img[,,2])
blue_aux = as.vector(img[,,3])
# Combine all channels into a single row
img_vector = c(red_aux, green_aux, blue_aux)
return(as.data.frame(t(img_vector))) # Transpose to make it a row
}
image_list = lapply(image_files, function(file) {
img_data = read_data(file)
img_data$Label = extract_label(file)
img_data$ID = extract_filename(file)
return(img_data)
})
ax = do.call(rbind, image_list)
return(ax)
}
### PCA Function
PCA.fun = function(X,matrix_bool = F){
if (!matrix_bool){
X = X %>% dplyr::select(-c("Label","ID"))
print(dim(X))
X = as.matrix(X)
} else{
X = X[,-ncol(X)]
print(dim(X))
}
n = nrow(X)
mu = colMeans(X)
# center data
G = sweep(X, 2, mu, FUN = "-")
# Compute the covariance matrix
cov_matrix =(1/(n-1))*(G %*% t(G))
## Calculate Eigenvalues and Eigenvectors and sort
eig = eigen(cov_matrix)
eig_val = eig$values
eig_vec = eig$vectors
ei_vec_large = t(G)%*%eig_vec
# sort
sort_index = order(eig_val, decreasing = TRUE)
eig_val_sorted = eig_val[sort_index]
eig_vec_sorted = ei_vec_large[, sort_index]
#variability of each PC
D = eig_val_sorted/sum(eig_val_sorted)
return(list("Eigen Vector"= eig_vec_sorted,"D"=D))
}
### load data
folder_path = "Training_alpha"
data = read_all_images(folder_path)
labels = data$Label
image_names = data$ID
# turn into matrix for some application
data_asmatrix = as.matrix(data %>% dplyr::select(-ID))
data_asmatrix = apply(data_asmatrix, 2, as.numeric)
PCA_result = PCA.fun(data)
## [1] 12 108000
eig_vecs = PCA_result$"Eigen Vector"
cumulative_variance = cumsum(PCA_result$D)
num_components = which(cumulative_variance >= 0.95)[1]
# Project data onto selected PCs
data_reduced_PCA = data_asmatrix[,-ncol(data_asmatrix)] %*% eig_vecs[, 1:num_components]
data_reduced_PCA = as.data.frame(apply(data_reduced_PCA, 2, as.numeric))
data_reduced_PCA_labeled = cbind(data_reduced_PCA, labels)
set.seed(123)
em_pca_results = EM_GMM(as.matrix(data_reduced_PCA), K = 2)
## Converged at iteration 8
labels = apply(em_pca_results$gamma, 1, function(row) which.max(row))
data_reduced_PCA_labeled$EM_labels = as.factor(labels)
data_reduced_PCA_labeled
## V1 V2 V3 labels EM_labels
## 1 -10382.270 912.81077 735.82619 1 2
## 2 -10234.995 914.03785 761.83617 1 2
## 3 -11250.989 257.37601 -254.62464 1 1
## 4 -11156.964 336.70711 -163.18099 1 1
## 5 -11236.863 255.30750 -134.01965 1 1
## 6 -11352.240 304.26472 -172.13487 1 1
## 7 8950.352 118.26988 263.67167 2 2
## 8 9026.702 -186.90475 378.53420 2 2
## 9 8895.656 15.43497 325.94968 2 2
## 10 8548.154 816.27236 -45.80203 2 2
## 11 7832.108 1085.73597 -195.14015 2 2
## 12 7351.097 1252.83305 -156.06647 2 2
# Scale V1, V2, and V3
data_reduced_PCA_labeled_scaled = data_reduced_PCA_labeled
data_reduced_PCA_labeled_scaled[, c("V1", "V2", "V3")] = scale(data_reduced_PCA_labeled[, c("V1", "V2", "V3")])
# Create 3D plot using plotly
plot_ly(data_reduced_PCA_labeled_scaled,
x = ~V1,
y = ~V2,
z = ~V3,
color = ~labels,
colors = c("blue", "red"),
type = 'scatter3d',
mode = 'markers',
marker = list(size = 5, opacity = 0.7)) %>%
layout(title = "3D PCA-Reduced Data Visualization",
scene = list(xaxis = list(title = 'V1'),
yaxis = list(title = 'V2'),
zaxis = list(title = 'V3')))
# Generate random samples from both Gaussians
set.seed(42)
samples1 = mvrnorm(n = 100, mu = em_pca_results$mu[1,], Sigma = em_pca_results$sigma[,,1])
samples2 = mvrnorm(n = 100, mu = em_pca_results$mu[2,], Sigma = em_pca_results$sigma[,,2])
# Create a dataframe with both Gaussian distributions
data_gaussians = data.frame(
V1 = c(samples1[,1], samples2[,1]),
V2 = c(samples1[,2], samples2[,2]),
V3 = c(samples1[,3], samples2[,3]),
labels = factor(c(rep(1, 100), rep(2, 100))) # Labels for the two distributions
)
# Plot the Gaussians in 3D using V1, V2, and V3
plot_ly() %>%
# Plot the first Gaussian (Gaussian 1)
add_trace(data = data_gaussians[data_gaussians$labels == 1, ],
x = ~V1, y = ~V2, z = ~V3,
color = I("darkblue"), type = 'scatter3d',
mode = 'markers', marker = list(size = 6, opacity = 0.5),
name = "Samples from EM Gaussian 1") %>%
# Plot the second Gaussian (Gaussian 2)
add_trace(data = data_gaussians[data_gaussians$labels == 2, ],
x = ~V1, y = ~V2, z = ~V3,
color = I("darkred"), type = 'scatter3d',
mode = 'markers', marker = list(size = 6, opacity = 0.5),
name = "Samples from EM Gaussian 2") %>%
add_trace(data = data_reduced_PCA_labeled_scaled[data_reduced_PCA_labeled_scaled$labels == 1, ],
x = ~V1, y = ~V2, z = ~V3,
color = I("orange"), type = 'scatter3d',
mode = 'markers', marker = list(size = 6, opacity = 0.5),
name = "Image Points True ID = 1") %>%
add_trace(data = data_reduced_PCA_labeled_scaled[data_reduced_PCA_labeled_scaled$labels == 2, ],
x = ~V1, y = ~V2, z = ~V3,
color = I("purple"), type = 'scatter3d',
mode = 'markers', marker = list(size = 6, opacity = 0.5),
name = "Image Points True ID = 2") %>%
layout(title = "3D Gaussian Distributions",
scene = list(
xaxis = list(title = 'V1'),
yaxis = list(title = 'V2'),
zaxis = list(title = 'V3')
))
Expectation-Maximization (EM) is a good algorithm for clustering but does not perform well when applied to classification. The primary problem is that EM clusters data points without knowing their true class labels, so one cluster might consist of more than one true classes and hence results in misclassification. Moreover, EM works on the assumption that data is distributed according to a Gaussian distribution, while real-world datasets usually possess intricate structures which are not captured by EM. As it only clusters data by similarity of features instead of learning class boundaries, its clusters might not correspond to the actual categories. Although EM is helpful in identifying patterns among unlabeled data, it is not a good way of classification unless supplemented with further steps in order to label its clusters with meaningful labels.
We analyzed the Expectation-Maximization algorithm and its applications in clustering, image segmentation, and classification. Our results reflect its strength and limitations. The EM is an efficient tool for finding patterns in unlabeled data; hence, it seems to be well-suited for clustering applications. In image segmentation, it correctly separates the different regions based on similarities of pixels but fails in precision in terms of classifying colors properly, especially in gradual transitions. For a classification application, EM fails since it does not learn class labels directly; most of the time, it comes with uncertain or wrong groupings. These results confirm that despite the importance of EM in unsupervised learning, it may not be a good choice for providing the sharpest decision boundaries or the most accurate classification. It is important in designing such to first consider the strength and weakness of EM in practical use.